1 Introduction:

Any person that has walked through NYC streets during winter knows the risks of stepping into an ice covered pothole that will ruin his or her day in an instant. Car drivers also suffer from bad road conditions, and the city is responsible for keeping the streets in good condition. Our story is motivated by an article from the New York Times:

Report Reveals New York City Paid $138 Million in Settlements Related to Potholes on Roadways.
— The New York Times

Such article made us think that street defects are a relevant problem to our daily life–they may cause flat tires, twist (and freeze if its winter) ankles and cost significant money to the city in damage repairs. Here are two examples of street defects:

Pothole pothole Cave-in cave_in

Most complaints that are related to street conditions are processed by DOT–Department of Transportation. Hence, we decided to analyze their data from 2015. Our project is structured in the following way:

After subset the related information from 311 Dataset, we picked top eight descriptors we believed are related to street conditions within our new DOT dataset and make the density plot.

According to this plot, Pothole and Street light out seems to be the most common complaints in NYC.

2 Relevance of defects:

The first step of our analysis consisted on identifying the relevance of stred defects. In general, street defects have a wide range of negative consequences, mostly provoking both pedestrian and vehicle accidents. We decided to focus our study in looking at the relationship between “DOT-complaints about street defects” and “Motor vehicle collisions” in New York City during 2015. We introduced two dataset: “DOT_Complaints” which we subselected from the 311 service dataset. And “NYPD-Motor-Vehicle-collsions” which we found from NYC open data. When processing the datasets, we found out that among “contributes of the accidents” field in the second dataset, some of them are clearly not related to street conditions (for example: “Alcohol Involvement” and “Vehicle Defective”). In order to make our study more accurate, we picked the main causes (Obstruction/Debris’, ‘Other Lighting Defects’, ‘Pavement Defective’ and ‘Pavement Slippery’) that we consider to be related with street defects.

2.1 General Gif Map

To begin with, we made a gif map to show distribution of the complaints and accidents that happened in NYC during 2015. In this plot, and let each frame represent one month. The colored background represents the areas with most to least DOT complaints (from red to yellow respectively). On top of it, we plotted blue density lines to show the distribution of accidents.

Accidents and street defects: GIF

As we can see, the distribution of the two events keep changing from time to time, but they seem to be centered around the same areas. From here, we break down our study of accidents into two dimensions: Spatial and temporal.

2.3 Complaints & Accidents temporal analysis:

We start our analysis with a time series plot (by day). Here, we used dashed lines to represent complaints and solid lines represent accidents. We also chose a scaled count instead of an actual count to help plotting gvien that there are more complaints than accidents every day.

From the plot, we can see that both events have a peek in mid March, which could suggest that they are also related in time. Although, the increase in complaints actually happened after the increase of accidents, which could be explained by the fact that people probably tend to complain about street defects only after they had an accident.

3 Cause of Defects

In this section, we want to explore the causes of street defects. We especially concentrate our attention in three aspects: amount of snow fall, presence of heavy vehicles, and estimated amount of traffic.

3.1 Snow fall

In addition to forming ice layers that cause vehicles to skid out of control and have more accidents on average, snow and freezing temperatures harm roads in the following way: When freezing temperatures occur, the water that had previously entered the ground freezes, causing an expansion that breaks the asphalt. Therefore, we expect road defects and snowfall to have a similar trend. In order to find out the relationship between snowfall and street defects, we used the 2010-2015 annual snowfall data from Central Park meteorological observation station.

In the graph above, we have plotted the annual snowfall, road related 311 complaints, and street light related 311 complaints in new york city from 2010 to 2015. The Green line shows the snowfall reached its highest point in 2011, then it decreased in 2012, but increased again in 2013 and 2014. The red line representing road related complaints trend, reveals strong positive correlation with snowfall, which is consistent with our expectation. It also reached its peak in 2011, dropped sharply in 2012, then increased dramatically in 2013 and 2014. Street light related claicomplaints do not show significant fluctuations from 2010 to 2015, which suggests that these are not related with the weather.

3.2 Truck

As defined in New York City Traffic Rules, a truck is “any vehicle designed for the transportation of property that has the characteristics: two axles and six tires, or three or more axles”. As research reveals, heavy vehicles (including trucks) contribute importantly to roads and bridges damages such as potholes. For purpose of maintaining good road condition, trucks are only allowed traveling on certain roads in New York City. In this section, we used NYC Truck Routes Data from New York City Department of Transportation.

We plotted truck routes in New York City and relevant 311 complaints in the above graph. The blue lines represent truck routes, the black circles represent pothole 311 complaints, and the brown circles represent cave-in 311 complaints. After careful observation, we found an interesting phenomenon: while many black circles are near or on blue lines, brown circles are far from those truck paths. This is because cave-ins are usually caused by structural characteristics of the ground, and not so much by traffic that goes through the street. To measure objectively the relationship between trucks and defects, we computed, for each defect, its closest distance to a truck road. We then plotted the distribution of these closest distances to see “how close” on average, each defect is from a truck road. Lower distances suggest that defects are formed on or around areas that usually have trucks circulating on them. The distribution is the following:

This density graph is consistent with our conjecture. Compared to cave-ins, potholes are more concentrated around truck routes. For the purpose of maintain good road conditions, we suggest the department of transportation to take more measures to protect the truck routes, for example, using new paving materials that are pothole-resistent, and giving more frequent maintenance to such roads.

It is important to remark that these truck roads include “Through-roads” and “Local-roads”. This means that no truck is allowed to circulate out of these roads unless they are headed directly to or from a destination out of these designated areas.

3.3 Traffic

In addition to heavy vehicles, constant traffic, even of smaller vehicles, can damage the roads as well. To explore the relationship between traffic and road damages (using DOT complaints as proxy), we combine 311 complaint data with 2015 Manhattan traffic open data. The traffic data contains statistics of traffic by time of the day in each several road segments in the city. We first transformed the data into geographic coordinates, and then mapped these coordinates to a zipcode, then aggregated traffic by zip code to get an estimate of cars that circulate on each zip code every day.

Drawing Drawing

From the plots above, we can see that the correlation between traffic and road condition complaints is not obvious. However, we can still observe some positive relationship between traffic and road condition. It is meaningful, since traffic is only one of many factors for road condition complaints, other factors. Additionally, this could suggest that DOT does a good job giving maintenance to areas with more traffic.

4 Response time

The objective of this section is to measure the response time for street defects, as reported in the 311 dataset. We measure the response time as (“Closed date” - “Created date”). We started with an exploratory analysis, to describe how relevant factors are correlated with response time. Based on our results, we built a predictive model to try to determine an estimated response time according to different attributes of each street defect complaint. This predictive model could be useful to the DOT in order to help them be more self-aware of their service quality.

4.1 Exploratory analysis

The response time varied significantly within a wide range going from a few hours to over a year (435 days). The distribution, however, is mostly centered around 2-5 days as shown in the graph below:

To understand which factors help determine the response time, we did a break down by type of street defect, and analyzed how temporality, location, and socioeconomic variables seemed to be correlated with response time.

4.1.1 Breakdown by type of complaint:

As we can see, potholes, street lights out, and cave-ins have the largest average response time. However, overall there doesn’t seem to be an important difference between different types of complaints’ average response time. Variance, on the other side, seems to be larger for potholes, cave-ins, and street light cycling. The variance in potholes (and to a certain extent, cave-ins as well) was expected, given that there are very different types of potholes, and it would be expected that larger potholes get prioritized over smaller ones.

The next step is to explore response time across temporality.

4.1.2 Temporality analysis:

Before looking at the data, we would have expected that response time would be higher for weekends and for winter, when we know there are more defects, and also repairs are probably harder to make because of the harsh weather.

First, we analyzed response time by season. As we mentioned, we expected winter to have the highest response time. However, we were surprised (positively) to see that it is not the case. Winter does seem to have a relatively high variance, but not a very higher average.

To continue our analysis, we computed the distribution of the response time by day of the week. Again, our initial hypothesis that weekends would have a higher average response time was not supported by the data.

The response time distribution by day of the week is as follows:

To conclude our temporality exploratory analysis, we did one more breakdown by time of the day. We found that complaints that come in after noon have a slightly higher average response time than complaints that come in the morning. This makes sense because it reflects that complaints that come in in the afternoons or evenings are probably more likely to be responded to the next day. Higher variability in afternoon complaints is also explained by this reason.

4.1.3 Geographical analysis.

In addition to exploring temporality variations, we explored how response time varies according to boroughs and zip codes. Moreover, we included information about each zip code’s income level to understand the relationship between these variables.

The first finding was, unsurprisingly, that Manhattan has the best response time among all New York boroughs. This could be explained by a number of reasons. Manhattan is the busiest of all boroughs, and it also has the highest income per capita. Another likely reason is that distances are relatively short and it is probably easy for DOT repair teams to reach street defect locations faster.

An interesting observation from the response time by borough, is the different variabilities. Brooklyn seems to have a very high variance in response time, while Manhattan and Staten Island are more concentrated around their means. Bronx and Queens seem to be in a middle ground, in terms of variability. While we found this difference in variabilities, we did not dig deeper to explore potential causes of such differential in variability. We did, however, explore a potential cause for difference on average time.

As mentioned before, one of the factors we think could be related with response time, is income. Richer areas tend to have overall more visibility and influence over government. Therefore, we would expect a better response time the higher the income. To do an income analysis, we aggregated the average complaints’ response time by zip code, and joined the median income data from the US Census Bureau.

In the plot below, we can observe the relationship between median income (in log scale), average response time, and borough:

Interestingly, most boroughs present a negative correlation between median income and response time. The only exception to this relationship is Queens, where there seems to be a positive correlation. In all cases, however, the slope does not seem to be very steep. In addition to median income, we did the same analysis with median rent compared to response time. We mapped the results to visualize how income and response time are geographically distributed, and see how they are correlated:

4.2 Predictive model:

To conclude our response time analysis, we decided to build a predictive model with the variables we have discussed so far.

4.2.1 The model.

Our dependent variable is response time (continuous) and our predictors are:

  • Time of complaint: Early morning (0:00-5:59 am), morning (6:00-11:59 am), afternoon (12:00-5:59 pm), and evening (6:00 – 11:59 pm)
  • Season
  • Log(median income)
  • Borough
  • Defect type
  • Interaction between borough and income

We ran a linear regression to understand the signs and magnitudes of the correlations of these variables with response time. The results are summarized in the following table.

  Estimate Std. Error t value Pr(>|t|)
time_dayearly_morning -0.05231 0.006294 -8.31 9.685e-17
time_dayevening 0.01395 0.004903 2.846 0.004431
time_daymorning -0.09257 0.007844 -11.8 4.001e-32
seasonSpring -0.1789 0.005402 -33.12 2.041e-239
seasonSummer 0.1391 0.006523 21.32 1.135e-100
seasonFall -0.1788 0.006823 -26.21 5.946e-151
log(median_income) -0.1271 0.00481 -26.41 2.666e-153
DescriptorDefective Hardware -0.5472 0.01238 -44.19 0
DescriptorFailed Street Repair -0.7633 0.01405 -54.31 0
DescriptorPothole -0.2383 0.008633 -27.6 3.749e-167
DescriptorRough, Pitted or Cracked Roads -0.3269 0.01264 -25.86 5.084e-147
DescriptorStreet Light Cycling -1.249 0.01582 -78.96 0
DescriptorStreet Light Out -0.2735 0.008972 -30.48 3.151e-203
boroughBrooklyn 0.9542 0.007236 131.9 0
boroughManhattan -1.067 0.009438 -113.1 0
boroughQueens -0.2841 0.007109 -39.97 0
boroughStaten 0.2478 0.008847 28.01 4.88e-172
(Intercept) 4.431 0.04018 110.3 0
Fitting linear model: response.Time ~ time_day + season + log(median_income) + Descriptor + borough
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
115918 0.6866 0.5218 0.5217

By inspecting the summary table for our regression, we observe that all of our variables are statistically significant. This significance is, however, probably due to the large number of observations, which reduces the estimates’ variance significantly.

On the other hand, it is worth noting that the largest coefficient estimates are those associated with borough and with type of defect. When controlling for other variables, cave-ins (base case) seem to have the highest response time, while light cycling has the lowest. The difference between these two is around 1.25 days.

In terms of boroughs, as we saw earlier, Manhattan has the lowest response time, followed by Queens, Bronx, Staten Island, and Brooklyn. The difference between Manhattan (lowest), and Brooklyn (highest) response time, was about 2 days, which is around 66% of the overall average response time.

Seasonally speaking, fall (seasonSON) and spring (seasonMAM) have the lowest response rate, although the variation is not too big. The difference between spring (lowest) and winter (highest) response time is only around 0.3 days, which is around 10% of the overall average response time.

4.2.2 Measuring performance

Finally, we tested how our model would predict response time by splitting the data into a training and a testing set (75% and 25% respectively). Since we are predicting a continuous variables, we needed to come up with a way to measure our performance rate in an interpretable way. Keeping this in mind, we computed the absolute value of the residuals as a percentage of the true response time, and averaged across the whole testing set.

To evaluate our measurement, we will compare the average prediction error with the average prediction error that would be obtained from simply predicting the average response time for any case (which is 3 days).

Our resulting prediction error was 18.04%, which is an improvement of 12 percentage points when compared against the simple average prediction, which was 29.69%.

5 Conclusions

Street defects are very relevant in any city, especially in a place like NYC that has several factors contributing to the appearance of such. An important amount of traffic (both light an heavy vehicles), and the extreme weather conditions observed throughout the year, make New York City streets especially vulnerable to such problems. The consequences of such defects are varied, ranging from twisted ankles and flat tires, to serious life threatening accidents and multi-million dollar claims that cost taxpayers a significant amount of money each year. Moreover, there is a significant amount of damage that is not accounted for. A great proportion of people affected by street defects would probably not report their damage, making it very difficult to estimate accurately the real costs of street defects.

To improve their street defects repairs, the DOT could focus their efforts in prioritizing maintenance jobs on areas where trucks are allowed. Also, they should keep doubling efforts during winter and raining time, which, according to our findings, they probably are already doing.

Furthermore, the DOT should consider reviewing their internal processes to help improve their response time in lower income areas, where response time seems to be skewed to the higher income neighborhoods.